$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$ $\newcommand{\amax}{{\text{argmax}}}$ $\newcommand{\P}{{\mathbb{P}}}$ $\newcommand{\E}{{\mathbb{E}}}$ $\newcommand{\R}{{\mathbb{R}}}$ $\newcommand{\Z}{{\mathbb{Z}}}$ $\newcommand{\N}{{\mathbb{N}}}$ $\newcommand{\C}{{\mathbb{C}}}$ $\newcommand{\abs}[1]{{ \left| #1 \right| }}$ $\newcommand{\simpl}[1]{{\Delta^{#1} }}$

Anomaly Detection via Density Estimation

Snow

import numpy as np
import itertools as it
from tqdm import tqdm

import matplotlib
from matplotlib import pyplot as plt
import plotly.express as px
import pandas as pd

import ipywidgets as widgets

from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \
perform_rkde_experiment, get_mnist_data

from ipywidgets import interact

from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import IsolationForest
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KernelDensity

from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst

from tensorflow import keras

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (5, 5)

Anomaly Detection via Density Estimation

Idea: Estimate the density of $F_0$. Areas of low density are anomalous.

  • Often $p$ is too small to estimate complete mixture model
  • Takes into account that $F_1$ might not be well-defined
  • Estimation procedure needs to be robust against contamination if no clean training data is available

Kernel Density Estimation

  • Non-parametric method
  • Can represent almost arbitrarily shaped densities
  • Each training point "spreads" a fraction of the probability mass as specified by the kernel function

Definition


Definition:

  • $K: \mathbb{R} \to \mathbb{R}$ kernel function
    • $K(r) \geq 0$ for all $r\in \mathbb{R}$
    • $\int_{-\infty}^{\infty} K(r) dr = 1$
  • $h > 0$ bandwidth
  • Bandwidth is the most crucial parameter

Definition:


Let $D = \{x_1,\ldots,x_N\}\subset \mathbb{R}^p$. The KDE with kernel $K$ and bandwidth $h$ is $KDE_h(x, D) = \frac{1}{N}\sum_{i=1}^N \frac{1}{h^p}K\left(\frac{|x-x_i|}{h}\right)$


Effect of bandwidth and kernel

Exercise

Play with the parameters!

dists = create_distributions(dim=2, dim_irrelevant=0)

sample_train = dists['Double Blob'].sample(500)
X_train = sample_train[-1]
y_train = [0]*len(X_train)

plt.scatter(X_train[:,0], X_train[:,1], c = 'blue', s=10)
plt.show()
2023-04-21 23:25:27.607732: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
# Helper function
def fit_kde(kernel: str, bandwidth: float, X_train: np.array) -> KernelDensity:
    """ Fit KDE
    
    @param kernel: kernel
    @param bandwidth: bandwidth
    @param x_train: data
    """
    kde = KernelDensity(kernel=kernel, bandwidth=bandwidth)
    kde.fit(X_train)
    return kde

def visualize_kde(kde: KernelDensity, bandwidth: float, X_test: np.array, y_test: np.array) -> None:
    """Plot KDE
    
    @param kde: KDE
    @param bandwidth: bandwidth
    @param X_test: test data
    @param y_test: test label
    """
    fig, axis = plt.subplots(figsize=(5, 5))

    lin = np.linspace(-10, 10, 50)
    grid_points = list(it.product(lin, lin))
    ys, xs = np.meshgrid(lin, lin)
    # The score function of sklearn returns log-densities
    scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
    colormesh = axis.contourf(xs, ys, scores)
    fig.colorbar(colormesh)
    axis.set_title('Density Conturs (Bandwidth={})'.format(bandwidth))
    axis.set_aspect('equal')
    color = ['blue' if i ==0 else 'red' for i in y_test]
    plt.scatter(X_test[:, 0], X_test[:, 1], c=color)
    plt.show()

Choose KDE Parameters

ker = None
bdw = None
@interact(
    kernel=['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'],
    bandwidth=(.1, 10.)
)
def set_kde_params(kernel: str, bandwidth: float) -> None:
    """Helper funtion to set widget parameters
    
    @param kernel: kernel
    @param bandwidth: bandwidth
    """
    global ker, bdw

    ker = kernel
    bdw = bandwidth
kde = fit_kde(ker, bdw, X_train)
visualize_kde(kde, bdw, X_train, y_train)

Bandwidth Selection

The bandwidth is the most important parameter of a KDE model. A wrongly adjusted value will lead to over- or under-smoothing of the density curve.

A common method to select a bandwidth is maximum log-likelihood cross validation. $$h_{\textrm{llcv}} = \arg\max_{h}\frac{1}{k}\sum_{i=1}^k\sum_{y\in D_i}\log\left(\frac{k}{N(k-1)}\sum_{x\in D_{-i}}K_h(x, y)\right)$$ where $D_{-i}$ is the data without the $i$th cross validation fold $D_i$.

Exercises

ex no.1: Noisy sinusoidal

# Generate example
dists = create_distributions(dim=2)

distribution_with_anomalies = contamination(
    nominal=dists['Sinusoidal'],
    anomaly=dists['Blob'],
    p=0.05
)

# Train data
sample_train = dists['Sinusoidal'].sample(500)
X_train = sample_train[-1].numpy()

# Test data
sample_test = distribution_with_anomalies.sample(500)
X_test = sample_test[-1].numpy()
y_test = sample_test[0].numpy()

scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
handels, _ = scatter.legend_elements()
plt.legend(handels, ['Nominal', 'Anomaly'])
plt.gca().set_aspect('equal')
plt.show()

TODO: Define the search space for the kernel and the bandwidth

param_space = {
    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
}
def hyperopt_by_score(X_train: np.array, param_space: dict, cv: int=5):
    """Performs hyperoptimization by score
    
    @param X_train: data
    @param param_space: parameter space
    @param cv: number of cv folds
    """
    kde = KernelDensity()

    search = RandomizedSearchCV(
        estimator=kde,
        param_distributions=param_space,
        n_iter=100,
        cv=cv,
        scoring=None # use estimators internal scoring function, i.e. the log-probability of the validation set for KDE
    )

    search.fit(X_train)
    return search.best_params_, search.best_estimator_

Run the code below to perform hyperparameter optimization.

params, kde = hyperopt_by_score(X_train, param_space)

print('Best parameters:')
for key in params:
    print('{}: {}'.format(key, params[key]))

test_scores = -kde.score_samples(X_test)
test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)

curves = evaluate(y_test, test_scores)
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning: One or more of the test scores are non-finite: [-583.74541248 -463.01696684 -482.86882001 -499.81601835 -546.99909867
 -562.2118973  -404.18630801 -441.90964061 -508.5352788           -inf
 -455.89237528 -474.60064732 -460.72114131 -523.02652677 -599.66745977
 -469.21060725 -649.46845268 -484.62263455 -534.71782939 -385.81385264
 -498.35959278 -666.08926837 -402.40788348 -616.13224704 -514.42363587
 -486.8817315  -496.70674387 -576.02539342 -519.26203456          -inf
 -476.17747255          -inf -566.20708475 -633.59732893 -460.43606272
 -519.25900889 -631.67434369 -502.48986055 -539.33069415 -611.61260259
 -416.6039647  -406.4145171  -466.23139723 -485.14283562 -490.13677541
 -637.14220215 -559.25833087 -507.88071232 -486.91719244 -562.75298394
 -414.54025058 -420.71904273 -385.56190503 -503.34657184 -526.01859062
 -519.97321291 -404.56775515 -532.78639414 -411.43385054 -486.28358272
 -606.08910552 -581.33532785 -405.37961147 -510.35741871 -667.85545766
 -593.96240838 -624.98459821 -498.05695285 -587.68751473 -523.85639014
 -535.3897791  -587.05218198 -492.07924826 -608.63368405 -500.8619376
          -inf -596.29256226 -485.75889656 -481.95324848 -431.57291947
 -505.76999431 -485.82228871 -659.05544596          -inf          -inf
 -641.6261979  -405.38065655 -613.68889441 -600.98775327 -457.76369778
 -531.39767679 -497.24276193 -603.57527611 -518.98183963 -484.21261874
 -453.4834935  -617.71569138          -inf -490.16773265 -510.35538669]
  warnings.warn(
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning: invalid value encountered in subtract
  array_stds = np.sqrt(np.average((array -
Best parameters:
kernel: linear
bandwidth: 1.0
visualize_kde(kde, params['bandwidth'], X_test, y_test)

Exercise: Isolate anomalies in house prices

You are a company resposible to estimate house prices around Ames, Iowa, specifically around college area. But there is a problem: houses from a nearby area, 'Veenker', are often included in your dataset. You want to build an anomaly detection algorithm that filters one by one every point that comes from the wrong neighborhood. You have been able to isolate an X_train dataset which, you are sure, contains only houses from College area. Following the previous example, test your ability to isolate anomalies in new incoming data (X_test) with KDE.

Advanced exercise: What happens if the contamination comes from other areas? You can choose among the following names:

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

X_train, X_test, y_test = get_house_prices_data(neighborhood = 'CollgCr', anomaly_neighborhood='Veenker')
X_train
LotArea SalePrice OverallCond
0 8461 163990 5
1 10335 204000 5
2 9548 237000 6
3 9245 145000 5
4 15523 133500 6
... ... ... ...
115 9240 287000 5
116 11317 180000 5
117 4426 141000 5
118 9066 230000 5
119 7990 110000 6

120 rows × 3 columns

# Total data
train_test_data = X_train.append(X_test, ignore_index=True)
y_total = [0] * len(X_train) + y_test

fig = px.scatter_3d(train_test_data, x='LotArea', y='OverallCond', z='SalePrice', color=y_total)

fig.show()

Solution

# When data are highly in-homogeneous, like in this case, it is often beneficial 
# to rescale them before applying any anomaly detection or clustering technique.
scaler = MinMaxScaler()
X_train_rescaled = scaler.fit_transform(X_train)
param_space = {
    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
}
params, kde = hyperopt_by_score(X_train_rescaled, param_space)
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning:

One or more of the test scores are non-finite: [ -45.25900425  -42.41544626 -105.24271623  -76.62262594  -58.16553273
  -52.66553234 -183.01584751 -119.35431748  -98.69187821 -168.03134231
 -111.03706006  -62.91122917 -234.261542   -168.22031548 -220.86820955
  -92.32919548 -154.69523369 -227.42928936  -84.284484   -142.67107811
 -117.16427593 -113.6970155  -227.52633839 -161.95663169  -41.52019748
 -136.74875435 -152.97004208 -128.4919998    19.20525133          -inf
  -98.87203905          -inf -159.80490645 -162.17445626  -67.13120683
 -117.6390077  -140.62181147  -50.67220638 -237.44899903 -197.22483009
 -123.50498573 -188.99783275 -101.07642949  -72.83784089 -229.7863771
 -132.07211645 -168.26256671 -230.06793251 -135.31495507 -187.61056982
 -147.31823309          -inf -146.00497938  -33.29831913 -194.93892381
  -96.25876027 -178.48444701 -123.12220664  -83.77069893 -199.88529605
 -170.61800732 -186.74828407 -134.23720459  -35.0511072  -131.81801061
 -224.69026195 -164.15751703          -inf -217.86235331  -79.81216211
 -124.69089389  -13.75418293 -192.81244316 -167.46002124  -72.58312108
 -160.42768007          -inf -229.19908446 -159.69783332 -199.44038951
 -196.43550278 -135.24648056  -71.43898844 -191.77357892 -177.71452367
 -153.08130804  -64.75586096 -151.5744935  -104.68544216 -107.00124511
 -192.57805657    5.68816971 -158.41708204  -30.07922034 -203.15690702
 -165.74543603 -155.17305524 -243.42948442 -142.66462637 -179.45090448]

/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning:

invalid value encountered in subtract

print('Best parameters:')
for key in params:
    print('{}: {}'.format(key, params[key]))

X_test_rescaled = scaler.transform(X_test)
test_scores = -kde.score_samples(X_test_rescaled)
test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
curves = evaluate(y_test, test_scores)
Best parameters:
kernel: epanechnikov
bandwidth: 0.4

The Curse of Dimensionality

The flexibility of KDE comes at a price. The dependency on the dimensionality of the data is quite unfavorable.

Theorem [Stone, 1982] Any estimator that is consistent$^*$ with the class of all $k$-fold differentiable pdfs over $\mathbb{R}^d$ has a convergence rate of at most

$$ \frac{1}{n^{\frac{k}{2k+d}}} $$

$^*$Consistency = for all pdfs $p$ in the class: $\lim_{n\to\infty}|KDE_h(x, D) - p(x)|_\infty = 0$ with probability $1$.

Exercise

  • The very slow convergence in high dimensions does not necessary mean that we will see bad results in high dimensional anomaly detection with KDE.
  • Especially if the anomalies are very outlying.
  • However, in cases where contours of the nominal distribution are non-convex we can run into problems.

We take a look at a higher dimensional version of out previous data set.

dists = create_distributions(dim=3)

distribution_with_anomalies = contamination(
    nominal=dists['Sinusoidal'],
    anomaly=dists['Blob'],
    p=0.01
)

sample = distribution_with_anomalies.sample(500)

y = sample[0]
X = sample[-1]
fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y)
fig.show()
# Fit KDE on high dimensional examples 
rocs = []
auprs = []
bandwidths = []

param_space = {
        'kernel': ['gaussian'],
        'bandwidth': np.linspace(0.1, 100, 1000), # Define Search space for bandwidth parameter
    }

kdes = {}
dims = np.arange(2,16)
for d in tqdm(dims):
    # Generate d dimensional distributions
    dists = create_distributions(dim=d)

    distribution_with_anomalies = contamination(
        nominal=dists['Sinusoidal'],
        anomaly=dists['Blob'],
        p=0
    )

    # Train on clean data
    sample_train = dists['Sinusoidal'].sample(500)
    X_train = sample_train[-1].numpy()
    # Test data
    sample_test = distribution_with_anomalies.sample(500)
    X_test = sample_test[-1].numpy()
    y_test = sample_test[0].numpy()

    # Optimize bandwidth
    params, kde = hyperopt_by_score(X_train, param_space)
    kdes[d] = (params, kde)
    
    bandwidths.append(params['bandwidth'])

    test_scores = -kde.score_samples(X_test)
    test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)

    
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00,  2.84s/it]
# Plot cross section of pdf 
fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(15, 5))
for d, axis in tqdm(list(zip(kdes, axes.flatten()))):
    
    params, kde = kdes[d]

    lin = np.linspace(-10, 10, 50)
    grid_points = list(it.product(*([[0]]*(d-2)), lin, lin))
    ys, xs = np.meshgrid(lin, lin)
    # The score function of sklearn returns log-densities
    scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
    colormesh = axis.contourf(xs, ys, scores)
    axis.set_title("Dim = {}".format(d))
    axis.set_aspect('equal')
    

# Plot evaluation
print('Crossection of the KDE at (0,...,0, x, y)')
plt.show()
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:01<00:00,  8.39it/s]
Crossection of the KDE at (0,...,0, x, y)

Robustness

Another drawback of KDE in the context of anomaly detection is that it is not robust against contamination of the data

Definition The breakdown point of an estimator is the smallest fraction of observations that need to be changed so that we can move the estimate arbitrarily far away from the true value.

Example: The sample mean has a breakdown point of $0$. Indeed, for a sample of $x_1,\ldots, x_n$ we only need to change a single value in order to move the sample mean in any way we want. That means that the breakdown point is smaller than $\frac{1}{n}$ for every $n\in\mathbb{N}$.

Robust Statistics

There are robust replacements for the sample mean:

  • Median of means: Split the dataset into $S$ equally sized subsets $X_1,\ldots, X_S$ and compute $\mathrm{median}(\overline{X_1},\ldots, \overline{X_S})$
  • M-estimation: The mean in a normed vector space is the value that minimizes the squared distances
    $\overline{X} = \min_{y}\sum_{x\in X}|x-y|^2$
    M-estimation replaces the quadratic loss with a more robust loss function.

Huber loss

Switch from quadratic to linear loss at prescribed threshold

import numpy as np


def huber(error: float, threshold: float):
    """Huber loss
    
    @param error: base error
    @param threshold: threshold for linear transition
    """
    test = (np.abs(error) <= threshold)
    return (test * (error**2)/2) + ((1-test)*threshold*(np.abs(error) - threshold/2))

x = np.linspace(-5, 5)
y = huber(x, 1)

plt.plot(x, y)
plt.gca().set_title("Huber Loss")
plt.show()

Hampel loss

More complex loss function. Depends on 3 parameters 0 < a < b< r

import numpy as np

def single_point_hampel(error: float, a: float, b: float, r: float):
    """Hampel loss
    
    @param error: base error
    @param a: 1st threshold parameter
    @param b: 2nd threshold parameter
    @param r: 3rd threshold parameter
    """
    if abs(error) <= a:
        return error**2/2
    elif a < abs(error) <= b:
        return (1/2 *a**2 + a* (abs(error)-a))
    elif  b < abs(error) <= r:
        return a * (2*b-a+(abs(error)-b) * (1+ (r-abs(error))/(r-b)))/2
    else:
        return a*(b-a+r)/2

hampel = np.vectorize(single_point_hampel)

x = np.linspace(-10.1, 10.1)
y = hampel(x, a=1.5, b=3.5, r=8)

plt.plot(x, y)
plt.gca().set_title("Hampel Loss")
plt.show()

KDE is a Mean

Kernel as scalar product:

  • Let $K$ be a radial monotonic$^\ast$ kernel over $\mathbb{R}^n$.
  • For $x\in\mathbb{R}^n$ let $\phi_x = K(\cdot, x)$.
  • Vector space over the linear span of $\{\phi_x \mid x\in\mathbb{R}^n\}$:
    • Pointwise addition and scalar multiplication.
  • Define the scalar product $\langle \phi_x, \phi_y\rangle = K(x,y)$.
  • Advantage: Scalar product is computable
  • Call this the reproducing kernel Hilbert space (RKHS) of $K$.
  • $\mathrm{KDE}_h(\cdot, D) = \frac{1}{N}\sum_{i=1}^N K_h(\cdot, x_i) = \frac{1}{N}\sum_{i=1}^N\phi_{x_i}$
    • where $K_h(x,y) = \frac{1}{h}K\left(\frac{|x-y|}{h}\right)$

$^*$All kernels that we have seen are radial and monotonic

Exercise

We compare the performance of different approaches to recover the nominal distribution under contamination. Here, we use code by Humbert et al. to replicate the results in the referenced paper on median-of-mean KDE. More details on rKDE can instead be found in this paper by Kim and Scott.

# =======================================================
#   Parameters
# =======================================================
algos = [
    'kde',
    'mom-kde', # Median-of-Means
    'rkde-huber', # robust KDE with huber loss
    'rkde-hampel', # robust KDE with hampel loss
]

dataset = 'house-prices'
dataset_options = {'neighborhood': 'CollgCr', 'anomaly_neighborhood': 'Edwards'}

outlierprop_range = [0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5]
kernel = 'gaussian'
auc_scores = perform_rkde_experiment(
    algos,
    dataset,
    dataset_options,
    outlierprop_range,
    kernel,
)
Dataset:  house-prices
Outlier prop: 0.01 (1 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.02 (2 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 10 iterations

Outlier prop: 0.03 (3 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 3 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 3 iterations
Stop at 100 iterations

Outlier prop: 0.05 (4 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 5 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 5 iterations
Stop at 13 iterations

Outlier prop: 0.07 (5 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.1 (6 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.2 (7 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 5 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 5 iterations
Stop at 100 iterations

Outlier prop: 0.3 (8 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 15 iterations

Outlier prop: 0.4 (9 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.5 (10 / 10)
downsample inliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 3 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 3 iterations
Stop at 100 iterations
fig, ax = plt.subplots(figsize=(7, 5))
for algo, algo_data in auc_scores.groupby('algo'):
    x = algo_data.groupby('outlier_prop').mean().index
    y = algo_data.groupby('outlier_prop').mean()['auc_anomaly']
    ax.plot(x, y, 'o-', label=algo)
plt.legend()
plt.xlabel('outlier_prop')
plt.ylabel('auc_score')
plt.title('Comparison of rKDE against contamination')
Text(0.5, 1.0, 'Comparison of rKDE against contamination')

Try using different neighborhoods for contamination. Which robust KDE algorithm performs better overall? Choose among the following options:

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

You can also change the kernel type: gaussian, tophat, epechenikov, exponential, linear or cosine,

Summary

  • Kernel density estimation is a non-parametric method to estimate a pdf from a sample.
  • Bandwidth is the most important parameter.
  • Converges to the true pdf if $n\to\infty$.
    • Convergence exponentially depends on the dimension.
  • KDE is sensitive to contamination:
    • In a contaminated setting one can employ methods from robust statistics to obtain robust estimates.

Implementations

Snow